Expand description
§UNIC — Unicode Text Segmentation Algorithms
A component of unic
: Unicode and Internationalization Crates for Rust.
This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).
§Examples
assert_eq!(
Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
&["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);
assert_eq!(
Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
&["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);
assert_eq!(
GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
&[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
);
fn has_alphanumeric(s: &&str) -> bool {
s.chars().any(|ch| ch.is_alphanumeric())
}
assert_eq!(
Words::new(
"The quick (\"brown\") fox can't jump 32.3 feet, right?",
has_alphanumeric,
).collect::<Vec<&str>>(),
&["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);
assert_eq!(
WordBounds::new("The quick (\"brown\") fox").collect::<Vec<&str>>(),
&["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);
assert_eq!(
WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
&[
(0, "Brr"),
(3, ","),
(4, " "),
(5, "it's"),
(9, " "),
(10, "29.3"),
(14, "°"),
(16, "F"),
(17, "!")
]
);
Structs§
- Cursor-based segmenter for grapheme clusters.
- External iterator for grapheme clusters and byte offsets.
- External iterator for a string’s grapheme clusters.
- External iterator for word boundaries and byte offsets.
- External iterator for a string’s word boundaries.
- An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with
General_Category=Number
.
Enums§
- An error return indicating that not enough content was available in the provided chunk to satisfy the query, and that more content must be provided.
Constants§
- UNIC component description.
- UNIC component name.
- UNIC component version.
- The Unicode version of data